The city of San Francisco has been one of the earliest responders to the COVID-19 pandemic in the United States, issuing stay-at-home orders to the people on March 14, 2020. The state of California would follow to issue a state-wide call to stay-at-home on March 20, 2020. Due to early response and strict guidelines, San Francisco is one of the large metropolitan areas in the US that has been keeping COVID-19 largely undercontrol, with a relatively low number of cases and deaths compared to its population (7,000 cases and 64 deaths out of over 800,000 residents).
This project aims to understand the relationship between confirmed COVID-19 cases and San Francisco neighborhoods. As the city continues to re-open in the recent months, it is imperative to understand the relationship between the number of confirmed COVID-19 cases and neighborhood composition, particularly its venues. Under the assumption that most individuals are infected outside of their home, we can consider each venue as a potential site of infection. Doing so, we can analyze the relationship between the types and numbers of venues in a neighborhood and its cases.
The results of this analysis would be invaluable for local policymakers looking to understand the impact of re-opened venues on COVID-19 cases. This will inform them in shaping re-opening policy for the city in order to maintain public safety while still stimulating the local economy.
In order to correlate San Francisco COVID-19 cases and venues, we will be using two data sources: Four Square and DataSF.
Four Square is a location technology platform that provides information on venues. It uses crowdsourced data to provide information on venues around a point of interest. The venue information consists of:
DataSF provides public datasets to the city departments of San Francisco. The dataset we will be using, will detail:
#Imports the necessary packages
import pandas as pd
from sodapy import Socrata
import numpy as np
import itertools
import requests
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import googlemaps
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
In order to perform the following analysis, we will first need to visualize and display the data in order to get a sense of what is happening in San Francisco. To do so, first we will be visualizing the COVID-19 cases in each neighborhood in San Francisco using a heat map. This will tell us where are hotspots within the city.
From here, we will then look at the venue data provided by Four Square and examine the top venues in each neighborhood. This will give us a sense of what is popular and where people would congregate if they were to go out in these neighborhoods. These top venues would serve as the most probable site of infection should one occur in San Francisco..
For this project, we will be using the "COVID-19 Cases and Deaths Summarized by Geography" dataset which is provided by the Department of Public Health of San Francisco. DataSF provides an API link for users to directly download this datasets. The data is segragated based on zip code, neighborhood and census districts. For the purpose of this analysis, we will focus on the subset detailing neighborhood cases.
client = Socrata("data.sfgov.org", None)
# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("tpyr-dvnc", limit=2000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
# Isolates the data based on neighborhood segmentation
covid_df = results_df[results_df['area_type'] == 'Analysis Neighborhood']
for i in range(-3,0):
df_head = list(covid_df)
covid_df = covid_df.drop(df_head[i],1)
covid_df.fillna(0, inplace=True)
covid_df.reset_index(drop=True, inplace=True)
covid_df['count'] = covid_df['count'].astype(int)
covid_df['deaths'] = covid_df['deaths'].astype(int)
covid_df.head()
From the COVID-19 data, we obtain a list of San Francisco neighborhoods. The next step would be to find the coordinates of each neighborhood.
neigh = covid_df['id'].unique()
for n in neigh:
print(n)
print("There are {} neighborhoods in San Francisco!".format(len(neigh)))
# The code was removed by Watson Studio for sharing.
lat = []
lng = []
for n in neigh:
geocode_result = gmaps.geocode(n + ' San Francisco')
coord = geocode_result[0]['geometry']['location']
lat.append(coord['lat'])
lng.append(coord['lng'])
covid_df['Latitude'] = lat
covid_df['Longitude'] = lng
covid_df.head()
Using Folium, we can visualize the COVID-19 cases by neighborhood to see which neighborhoods have the highest number of cases. From the heatmap below, we see that the Mission and Bayview Hunters Point has the highest number of COVID-19 thus far.
# Create a map of SF and display it
sf_map = folium.Map(location=[37.77, -122.42], zoom_start=12)
sf_geojson_url = 'https://data.sfgov.org/api/geospatial/p5b7-5n3h?method=export&format=GeoJSON'
sf_geo = requests.get(sf_geojson_url).json()
sf_map.choropleth(
geo_data=sf_geo,
data=covid_df[['id','count']],
columns=['id', 'count'],
key_on='feature.properties.nhood',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='COVID-19 Cases'
)
# Add markers to map
for lat, lng, neighborhood in zip(covid_df['Latitude'], covid_df['Longitude'], covid_df['id']):
label = '{}'.format(neighborhood)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(sf_map)
# Display map
sf_map
Now that we have an idea of the number of COVID-19 cases in San Francisco neighborhoods, let us use the Four Square API to understand the kinds of venues that exists within these neighborhoods. Here, we will call the API to find the top 100 venues near each neighborhood, encode them using one-hot encoding and then list out the top 10 most popular venues for each neighborhood.
# The code was removed by Watson Studio for sharing.
def getNearbyVenues(names, latitudes, longitudes, radius=500):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
sf_venues = getNearbyVenues(names=covid_df['id'],
latitudes=covid_df['Latitude'],
longitudes=covid_df['Longitude'])
sf_venues.head()
venue_count_df = sf_venues[['Venue Category', 'Venue']].groupby('Venue Category').count()
venue_count_df.sort_values(by=['Venue'], inplace=True, ascending=False)
venue_count_df.reset_index(inplace=True)
print('There are {} uniques categories.'.format(len(sf_venues['Venue Category'].unique())))
print('The most common venue is ' + venue_count_df['Venue Category'].values[0])
print('The least common venue is ' + venue_count_df['Venue Category'].values[-1])
# one hot encoding
sf_onehot = pd.get_dummies(sf_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
sf_onehot['Neighborhood'] = sf_venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [sf_onehot.columns[-1]] + list(sf_onehot.columns[:-1])
sf_onehot = sf_onehot[fixed_columns]
sf_onehot.head()
sf_grouped = sf_onehot.groupby('Neighborhood').mean().reset_index()
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sf_grouped['Neighborhood']
for ind in np.arange(sf_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sf_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head(len(neigh))
Now that we have visualized the number of confirmed COVID-19 cases and the venues for each neighborhood, we can see if there is any correlation between them. The simplest way to do this is to segment the neighborhood into clusters. Clustering will group the neighborhoods with similar neighborhoods based on the dataset of interest. In this case, we are interested in seeing which neighborhoods are similar based on thir local venues.
From here, we can take a look to see if there is any correlation between the clusters we identified and COVID cases.
Now that we have identified the number of confirmed COVID-19 cases and the venues in each neighborhood, let us cluster the neighborhoods to see which are more similar to each other. From this, we can see if the concentration of COVID-19 cases is related to the venues of a particular neighborhood. To do so, we will cluster the neighborhoods based on the venue information using k-means.
# set number of clusters
kclusters = 5
sf_grouped_clustering = sf_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sf_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
# Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
sf_merged = covid_df
sf_merged = sf_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='id')
sf_merged.dropna(subset=['Cluster Labels'], axis=0, inplace=True)
sf_merged.head()
# create map
map_clusters = folium.Map(location=[37.77, -122.42], zoom_start=12)
map_clusters.choropleth(
geo_data=sf_geo,
data=covid_df[['id','count']],
columns=['id', 'count'],
key_on='feature.properties.nhood',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='COVID-19 Cases')
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_merged['Latitude'], sf_merged['Longitude'], sf_merged['id'], sf_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[int(cluster)],
fill=True,
fill_color=rainbow[int(cluster)],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
Now that we have clustered the neighborhoods, we can compare the COVID-19 cases per clusters. First, let us visualize the number of cases in each cluster. To do so, we will generate a box-and-whisker plot to see the spread of COVID-19 cases by cluster. From the graph below, it looks like there is no relation between the clusters and the number of cases since they vary quite largely within the groups.
clustered_data = sf_merged[['id','count','Cluster Labels']]
fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)
ax.set_title("Confirmed COVID-19 Cases by Clusters", fontsize= 20)
ax.set
data = [clustered_data['count'][clustered_data['Cluster Labels'] == 0],
clustered_data['count'][clustered_data['Cluster Labels'] == 1],
clustered_data['count'][clustered_data['Cluster Labels'] == 2],
clustered_data['count'][clustered_data['Cluster Labels'] == 3],
clustered_data['count'][clustered_data['Cluster Labels'] == 4]]
ax.boxplot(data,
labels= ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4'],
showmeans= True)
plt.xlabel("Clusters")
plt.ylabel("Number of Cases")
plt.show()
From our analysis, we see that across San Francisco, the majority of reported COVID-19 cases occur in the Mission, Bayshore Hunters Point, Excelsior and Tenderloin. These areas tend to be the more populated scenes in San Francisco where people tend to gather socially. The Mission is a well-known spot for bars, clubs, and Dolores Park. Bayshore has a good number of essential businesses and Excelsior is the location of City College of San Francisco and McLaren park. Meanwhile, the Tenderloin is a poorer neighborhood with a large homeless population which makes it susceptible to the spread of COVID-19.
When we clustered the neighborhoods based on the venues present, we get 5 clusters based on neighborhood similarity. However, the majority of the neighborhoods are withint 3 clusters while the other 2 clusters are sparse and could potentially account for outliers.
A limitation of this analysis is that it does not take into account the movement of people. The Bay Area and San Francisco has a phenomenon known as super commuters - individuals who travel a great distance to get to their workplace. This is commonly seen in the lower-income population which cannot afford to live in San Francisco but work their due to job availability or higher incomes. The inverse is also seen as many SF residents work for large technology companies around the Bay (Google in Mountain View, Facebook in Menlo Park, Apple in Cupertino, etc.). In this case, they travel and spend most of their days away from their SF homes. This analysis fails to take into account any movement that SF residents may take as a part of their job, which could lead to an infection occuring elsewhere but recorded for a SF neighborhood.
This analysis is also limited because it does not take into account the gradual re-opening and the state of the venues in each neighborhood. That is to say that it does not factor in when venues re-open. The assumption is that at this time period, venues have opened and are now a source of infection. However, it does not take into account when the venue has opened, at what capacity and how long it may have had a chance to be a site of infection.
The purpose of this project is to understand the local spread of COVID-19 in San Francisco under the hypothesis that open venues under the city's re-opening plan are a local site of infection. If this were true, we could understand what businesses and venues pose high risks of infection to their patrons and identify the venue make-up that makes a neighborhood most suscpetible to a spike in infections. From this information, policymakers and decision makers can carefully craft guidelines to inform businesses how they should approach re-opening and what their risks are. This would also inform how the city should prioritize businesses as they re-open in order to maintain public health and safety.
From the data, we see that their is no visible correlation between a neighborhood's venue make-up and their number of confirmed COVID-19 cases. The data shows that the different clusters of similar neighborhoods have a broad range of COVID-19 cases. Given the results and limitations listed above, we segregate neighborhoods based on their risks of COVID-19 based on the venues available. Further testing and analysis are needed to extrapolate a conclusion from this data.